Background

The file ‘WeatherEDA_202005_v002.Rmd’ contains exploratory data analysis for historical weather data as contained in METAR archives hosted by Iowa State University.

Data have been dowloaded, processed, cleaned, and integrated for several stations (airports) and years, with .rds files saved in “./RInputFiles/ProcessedMETAR”.

This module will perform initial modeling on the processed weather files. It builds on the previous ‘WeatherModeling_202006_v001.Rmd’ and leverages functions in ‘WeatherModelingFunctions_v001.R’.

Data Availability

There are three main processed files available for further exploration:

metar_postEDA_20200617.rds and metar_postEDA_extra_20200627.rds

  • source (chr) - the reporting station and time
  • locale (chr) - the descriptive name for source
  • dtime (dttm) - the date-time for the observation
  • origMETAR (chr) - the original METAR associated with the observation at that source and date-time
  • year (dbl) - the year, extracted from dtime
  • monthint (dbl) - the month, extracted from dtime, as an integer
  • month (fct) - the month, extracted from dtime, as a three-character abbreviation (factor)
  • day (int) - the day of the month, extracted from dtime
  • WindDir (chr) - previaling wind direction in degrees, stored as a character since ‘VRB’ means variable
  • WindSpeed (int) - the prevailing wind speed in knots
  • WindGust (dbl) - the wind gust speed in knots (NA if there is no recorded wind gust at that hour)
  • predomDir (chr) - the predominant wind direction as NE-E-SE-S-SW-W-NW-N-VRB-000-Error
  • Visibility (dbl) - surface visibility in statute miles
  • Altimeter (dbl) - altimeter in inches of mercury
  • TempF (dbl) - the Fahrenheit temperature
  • DewF (dbl) - the Fahrenheit dew point
  • modSLP (dbl) - Sea-Level Pressure (SLP), adjusted to reflect that SLP is recorded as 0-1000 but reflects data that are 950-1050
  • cTypen (chr) - the cloud type of the nth cloud layer (FEW, BKN, SCT, OVC, or VV)
  • cLeveln (dbl) - the cloud height in feet of the nth cloud layer
  • isRain (lgl) - was rain occurring at the moment the METAR was captured?
  • isSnow (lgl) - was snow occurring at the moment the METAR was captured?
  • isThunder (lgl) - was thunder occurring at the moment the METAR was captured?
  • p1Inches (dbl) - how many inches of rain occurred in the past hour?
  • p36Inches (dbl) - how many inches of rain occurred in the past 3/6 hours (3-hour summaries at 3Z-9Z-15Z-21Z and 6-hour summaries at 6Z-12Z-18Z-24Z and NA at any other Z times)?
  • p24Inches (dbl) - how many inches of rain occurred in the past 24 hours (at 12Z, NA at all other times)
  • tempFHi (dbl) - the high temperature in the past 24 hours, in Fahrenheit (reported once per day)
  • tempFLo (dbl) - the low temperature in the past 24 hours, in Fahrenheit (reported once per day)
  • minHeight (dbl) - the minimum cloud height in feet (-100 means ‘no clouds’)
  • minType (fct) - amount of obscuration at the minimum cloud height (VV > OVC > BKN > SCT > FEW > CLR)
  • ceilingHeight (dbl) - the minimum cloud ceiling in feet (-100 means ‘no ceiling’)
  • ceilingType (fct) - amount of obscuration at the minimum ceiling height (VV > OVC > BKN)

metar_modifiedClouds_20200617.rds and metar_modifiedclouds_extra_20200627.rds

  • source (chr) - the reporting station and time
  • sourceName (chr) - the descriptive name for source
  • dtime (dttm) - the date-time for the observation
  • level (dbl) - cloud level (level 0 is inserted for every source-dtime as a base layer of clear)
  • height (dbl) - level height (height -100 is inserted for every source-dtime as a base layer of clear)
  • type (dbl) - level type (type CLR is inserted for every source-dtime as a base layer of clear)

metar_precipLists_20200617.rds and metar_precipLists_extra_20200627.rds

  • Contains elements for each of rain/snow/thunder for each of 2015/2016/2017
  • Each element contains a list and a tibble
  • The tibble is precipLength and contains precipitation by month as source-locale-month-hours-events
  • The list is precipList and contains data on each precipitation interval

Several mapping files are defined for use in plotting; tidyverse, lubridate, and caret are loaded; and the relevant functions are sourced:

# The process frequently uses tidyverse, lubridate, caret, and randomForest
library(tidyverse)
## -- Attaching packages ------------------------------------------------------------------------- tidyverse 1.3.0 --
## v ggplot2 3.2.1     v purrr   0.3.3
## v tibble  2.1.3     v dplyr   0.8.4
## v tidyr   1.0.2     v stringr 1.4.0
## v readr   1.3.1     v forcats 0.4.0
## -- Conflicts ---------------------------------------------------------------------------- tidyverse_conflicts() --
## x dplyr::filter() masks stats::filter()
## x dplyr::lag()    masks stats::lag()
library(lubridate)
## 
## Attaching package: 'lubridate'
## The following object is masked from 'package:base':
## 
##     date
library(caret)
## Loading required package: lattice
## 
## Attaching package: 'caret'
## The following object is masked from 'package:purrr':
## 
##     lift
library(randomForest)
## randomForest 4.6-14
## Type rfNews() to see new features/changes/bug fixes.
## 
## Attaching package: 'randomForest'
## The following object is masked from 'package:dplyr':
## 
##     combine
## The following object is masked from 'package:ggplot2':
## 
##     margin
# The main path for the files
filePath <- "./RInputFiles/ProcessedMETAR/"


# Sourcing functions
source("./WeatherModelingFunctions_v001.R")


# Descriptive names for key variables
varMapper <- c(source="Original source file name", 
               locale="Descriptive name",
               dtime="Date-Time (UTC)",
               origMETAR="Original METAR",
               year="Year",
               monthint="Month",
               month="Month", 
               day="Day of Month",
               WindDir="Wind Direction (degrees)", 
               WindSpeed="Wind Speed (kts)",
               WindGust="Wind Gust (kts)",
               predomDir="General Prevailing Wind Direction",
               Visibility="Visibility (SM)", 
               Altimeter="Altimeter (inches Hg)",
               TempF="Temperature (F)",
               DewF="Dew Point (F)", 
               modSLP="Sea-Level Pressure (hPa)", 
               cType1="First Cloud Layer Type", 
               cLevel1="First Cloud Layer Height (ft)",
               isRain="Rain at Observation Time",
               isSnow="Snow at Observation Time",
               isThunder="Thunder at Obsevation Time",
               tempFHi="24-hour High Temperature (F)",
               tempFLo="24-hour Low Temperature (F)",
               minHeight="Minimum Cloud Height (ft)",
               minType="Obscuration Level at Minimum Cloud Height",
               ceilingHeight="Minimum Ceiling Height (ft)",
               ceilingType="Obscuration Level at Minimum Ceiling Height", 
               hr="Hour of Day (Zulu time)"
               )


# File name to city name mapper
cityNameMapper <- c(katl_2016="Atlanta, GA (2016)",
                    kbos_2016="Boston, MA (2016)", 
                    kdca_2016="Washington, DC (2016)", 
                    kden_2016="Denver, CO (2016)", 
                    kdfw_2016="Dallas, TX (2016)", 
                    kdtw_2016="Detroit, MI (2016)", 
                    kewr_2016="Newark, NJ (2016)",
                    kgrb_2016="Green Bay, WI (2016)",
                    kgrr_2016="Grand Rapids, MI (2016)",
                    kiah_2016="Houston, TX (2016)",
                    kind_2016="Indianapolis, IN (2016)",
                    klas_2014="Las Vegas, NV (2014)",
                    klas_2015="Las Vegas, NV (2015)",
                    klas_2016="Las Vegas, NV (2016)", 
                    klas_2017="Las Vegas, NV (2017)", 
                    klas_2018="Las Vegas, NV (2018)",
                    klas_2019="Las Vegas, NV (2019)",
                    klax_2016="Los Angeles, CA (2016)", 
                    klnk_2016="Lincoln, NE (2016)",
                    kmia_2016="Miami, FL (2016)", 
                    kmke_2016="Milwaukee, WI (2016)",
                    kmsn_2016="Madison, WI (2016)",
                    kmsp_2016="Minneapolis, MN (2016)",
                    kmsy_2014="New Orleans, LA (2014)",
                    kmsy_2015="New Orleans, LA (2015)",
                    kmsy_2016="New Orleans, LA (2016)", 
                    kmsy_2017="New Orleans, LA (2017)", 
                    kmsy_2018="New Orleans, LA (2018)",
                    kmsy_2019="New Orleans, LA (2019)",
                    kord_2014="Chicago, IL (2014)",
                    kord_2015="Chicago, IL (2015)",
                    kord_2016="Chicago, IL (2016)", 
                    kord_2017="Chicago, IL (2017)", 
                    kord_2018="Chicago, IL (2018)",
                    kord_2019="Chicago, IL (2019)",
                    kphl_2016="Philadelphia, PA (2016)", 
                    kphx_2016="Phoenix, AZ (2016)", 
                    ksan_2014="San Diego, CA (2014)",
                    ksan_2015="San Diego, CA (2015)",
                    ksan_2016="San Diego, CA (2016)",
                    ksan_2017="San Diego, CA (2017)",
                    ksan_2018="San Diego, CA (2018)",
                    ksan_2019="San Diego, CA (2019)",
                    ksat_2016="San Antonio, TX (2016)", 
                    ksea_2016="Seattle, WA (2016)", 
                    ksfo_2016="San Francisco, CA (2016)", 
                    ksjc_2016="San Jose, CA (2016)",
                    kstl_2016="Saint Louis, MO (2016)", 
                    ktpa_2016="Tampa Bay, FL (2016)", 
                    ktvc_2016="Traverse City, MI (2016)"
                    )

# File names in 2016, based on cityNameMapper
names_2016 <- grep(names(cityNameMapper), pattern="[a-z]{3}_2016", value=TRUE)

The main data will be from the metar_postEDA files. They are integrated below, and cloud and ceiling heights are converted to factors:

# Main weather data
metarData <- readRDS("./RInputFiles/ProcessedMETAR/metar_postEDA_20200617.rds") %>%
    bind_rows(readRDS("./RInputFiles/ProcessedMETAR/metar_postEDA_extra_20200627.rds")) %>%
    mutate(orig_minHeight=minHeight, 
           orig_ceilingHeight=ceilingHeight, 
           minHeight=mapCloudHeight(minHeight), 
           ceilingHeight=mapCloudHeight(ceilingHeight)
           )
glimpse(metarData)
## Observations: 437,417
## Variables: 43
## $ source             <chr> "kdtw_2016", "kdtw_2016", "kdtw_2016", "kdtw_201...
## $ locale             <chr> "Detroit, MI (2016)", "Detroit, MI (2016)", "Det...
## $ dtime              <dttm> 2016-01-01 00:53:00, 2016-01-01 01:53:00, 2016-...
## $ origMETAR          <chr> "KDTW 010053Z 23012KT 10SM OVC020 00/M05 A3021 R...
## $ year               <dbl> 2016, 2016, 2016, 2016, 2016, 2016, 2016, 2016, ...
## $ monthint           <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
## $ month              <fct> Jan, Jan, Jan, Jan, Jan, Jan, Jan, Jan, Jan, Jan...
## $ day                <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
## $ WindDir            <chr> "230", "230", "230", "240", "230", "220", "220",...
## $ WindSpeed          <int> 12, 12, 11, 14, 16, 13, 14, 16, 13, 16, 17, 13, ...
## $ WindGust           <dbl> NA, NA, NA, 23, 22, NA, 20, 20, NA, 22, NA, NA, ...
## $ predomDir          <fct> SW, SW, SW, SW, SW, SW, SW, SW, SW, SW, SW, SW, ...
## $ Visibility         <dbl> 10, 10, 10, 10, 10, 10, 10, 10, 8, 5, 7, 8, 10, ...
## $ Altimeter          <dbl> 30.21, 30.21, 30.19, 30.19, 30.18, 30.16, 30.14,...
## $ TempF              <dbl> 32.00, 32.00, 32.00, 30.92, 30.92, 32.00, 30.92,...
## $ DewF               <dbl> 23.00, 21.92, 21.02, 19.94, 19.94, 19.94, 19.94,...
## $ modSLP             <dbl> 1023.6, 1023.5, 1023.0, 1023.0, 1022.7, 1022.0, ...
## $ cType1             <chr> "OVC", "OVC", "OVC", "OVC", "OVC", "OVC", "OVC",...
## $ cType2             <chr> "", "", "", "", "", "", "", "", "", "OVC", "OVC"...
## $ cType3             <chr> "", "", "", "", "", "", "", "", "", "", "", "", ...
## $ cType4             <chr> "", "", "", "", "", "", "", "", "", "", "", "", ...
## $ cType5             <chr> "", "", "", "", "", "", "", "", "", "", "", "", ...
## $ cType6             <chr> "", "", "", "", "", "", "", "", "", "", "", "", ...
## $ cLevel1            <dbl> 2000, 2000, 2200, 2500, 2700, 2500, 2500, 2500, ...
## $ cLevel2            <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, 4500, 4000, ...
## $ cLevel3            <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, ...
## $ cLevel4            <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, ...
## $ cLevel5            <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, ...
## $ cLevel6            <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, ...
## $ isRain             <lgl> FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE,...
## $ isSnow             <lgl> FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE,...
## $ isThunder          <lgl> FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE,...
## $ p1Inches           <dbl> NA, NA, NA, NA, NA, NA, NA, NA, 0, 0, 0, 0, 0, N...
## $ p36Inches          <dbl> NA, NA, NA, NA, NA, NA, NA, NA, 0, NA, NA, 0, NA...
## $ p24Inches          <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, ...
## $ tempFHi            <dbl> NA, NA, NA, NA, 36, NA, NA, NA, NA, NA, NA, NA, ...
## $ tempFLo            <dbl> NA, NA, NA, NA, 31, NA, NA, NA, NA, NA, NA, NA, ...
## $ minHeight          <fct> Low, Low, Low, Low, Low, Low, Low, Low, Low, Low...
## $ minType            <fct> OVC, OVC, OVC, OVC, OVC, OVC, OVC, OVC, OVC, BKN...
## $ ceilingHeight      <fct> Low, Low, Low, Low, Low, Low, Low, Low, Low, Low...
## $ ceilingType        <fct> OVC, OVC, OVC, OVC, OVC, OVC, OVC, OVC, OVC, BKN...
## $ orig_minHeight     <dbl> 2000, 2000, 2200, 2500, 2700, 2500, 2500, 2500, ...
## $ orig_ceilingHeight <dbl> 2000, 2000, 2200, 2500, 2700, 2500, 2500, 2500, ...

Initial Exploration

An initial exploration of the data can use hierarchical clustering based on several numerical features of the data (temperature, dew point, sea-level pressure, wind speed) by month.

Example code includes:

# Create hierarchical clustering for 2016 data
# Distance is calculated only where month is the same
distData <- metarData %>% 
    filter(year==2016) %>%
    select(locale, month, TempF, DewF, modSLP, WindSpeed) %>%
    group_by(locale, month) %>% 
    summarize_if(is.numeric, mean, na.rm=TRUE) %>% 
    ungroup() %>% 
    mutate(rowname=paste0(locale, "_", month)) %>% 
    column_to_rownames() %>% 
    select(-locale, -month) %>% 
    as.matrix() %>% 
    scale() %>% 
    dist() %>% 
    as.matrix() %>% 
    as.data.frame() %>%
    rownames_to_column(var="locale1") %>% 
    pivot_longer(-locale1, names_to="locale2", values_to="dist") %>% 
    mutate(city1=str_replace(locale1, "_\\w{3}", ""), 
           city2=str_replace(locale2, "_\\w{3}", ""), 
           month1=str_sub(locale1, -3), 
           month2=str_sub(locale2, -3)) %>%
    filter(month1==month2) %>% 
    group_by(city1, city2) %>% 
    summarize(meandist=mean(dist), sddist=sd(dist)) %>% 
    ungroup() %>% 
    select(-sddist) %>% 
    pivot_wider(city1, names_from="city2", values_from="meandist") %>%
    column_to_rownames(var="city1") %>%
    as.matrix() %>%
    as.dist(diag=FALSE)

distData %>%
    hclust(method="complete") %>%
    plot()

distData %>%
    hclust(method="single") %>%
    plot()

At a glance, there are several sensible findings from the clustering:

  • Two combinations of cities in close geographic proximity (Newark and Philadelphia; Chicago and Milaukee) show the greatest similarity in the clustering
  • A cluster of coastal California cities (San Diego, Los Angeles, San Jose) show strong similarities
  • A cluster of hot-humid cities (Houston-New Orleans and then soon Miami-Tampa) show strong similarities
  • Several clusters of cold-weather cities emerge
  • Two desert cities (Las Vega, Phoenix) show similarities

There are a handful of cities that do not seem to cluster with anything else:

  • Denver, Boston, Seattle, San Francisco

Temperature and Dew Point

Suppose that the only information available about two cities were their temperatures and dew points, with month also included. How well would a basic random forest, with mtry=3, classify the cities?

The function is then run for every combination of locales from 2016 in cityNameMapper. A common random seed is applied to every run of the process:

# Create a container list to hold the output
list_2016_TempF_DewF_month <- vector("list", 0.5*length(names_2016)*(length(names_2016)-1))

set.seed(2006281443)

n <- 1
for (ctr in 1:(length(names_2016)-1)) {
    for (ctr2 in (ctr+1):length(names_2016)) {
        list_2016_TempF_DewF_month[[n]] <- rfTwoLocales(metarData, 
                                                        loc1=names_2016[ctr], 
                                                        loc2=names_2016[ctr2], 
                                                        vrbls=c("TempF", "DewF", "month"),
                                                        ntree=25
                                                        )
        n <- n + 1
        if ((n %% 50) == 0) { cat("Through number:", n, "\n")}
    }
}
## Through number: 50 
## Through number: 100 
## Through number: 150 
## Through number: 200 
## Through number: 250 
## Through number: 300 
## Through number: 350 
## Through number: 400

Accuracy can then be assessed:

# Create a tibble from the underlying accuracy data
acc_TempF_DewF_month <- map_dfr(list_2016_TempF_DewF_month, .f=helperAccuracyLocale)

# Assess the top 10 classification accuracies
acc_TempF_DewF_month %>%
    arrange(-accOverall) %>%
    head(10)
## # A tibble: 10 x 5
##    locale1              locale2                 accOverall accLocale1 accLocale2
##    <chr>                <chr>                        <dbl>      <dbl>      <dbl>
##  1 Denver, CO (2016)    Miami, FL (2016)             0.997      0.997      0.997
##  2 Las Vegas, NV (2016) Miami, FL (2016)             0.990      0.991      0.989
##  3 Miami, FL (2016)     Seattle, WA (2016)           0.985      0.985      0.985
##  4 Denver, CO (2016)    Tampa Bay, FL (2016)         0.984      0.986      0.982
##  5 Miami, FL (2016)     Traverse City, MI (201~      0.980      0.981      0.980
##  6 Miami, FL (2016)     Minneapolis, MN (2016)       0.978      0.978      0.978
##  7 Boston, MA (2016)    Miami, FL (2016)             0.976      0.975      0.976
##  8 Denver, CO (2016)    New Orleans, LA (2016)       0.975      0.975      0.975
##  9 Miami, FL (2016)     Phoenix, AZ (2016)           0.974      0.973      0.975
## 10 Green Bay, WI (2016) Miami, FL (2016)             0.971      0.972      0.971
# Assess the bottom 10 classification accuracies
acc_TempF_DewF_month %>%
    arrange(accOverall) %>%
    head(10)
## # A tibble: 10 x 5
##    locale1                locale2               accOverall accLocale1 accLocale2
##    <chr>                  <chr>                      <dbl>      <dbl>      <dbl>
##  1 Newark, NJ (2016)      Philadelphia, PA (20~      0.580      0.591      0.570
##  2 Chicago, IL (2016)     Milwaukee, WI (2016)       0.591      0.600      0.583
##  3 Chicago, IL (2016)     Madison, WI (2016)         0.598      0.622      0.573
##  4 Detroit, MI (2016)     Grand Rapids, MI (20~      0.608      0.568      0.649
##  5 Chicago, IL (2016)     Detroit, MI (2016)         0.610      0.633      0.587
##  6 Grand Rapids, MI (201~ Traverse City, MI (2~      0.613      0.630      0.597
##  7 Madison, WI (2016)     Milwaukee, WI (2016)       0.618      0.625      0.611
##  8 Green Bay, WI (2016)   Madison, WI (2016)         0.619      0.586      0.652
##  9 Detroit, MI (2016)     Madison, WI (2016)         0.619      0.639      0.598
## 10 Chicago, IL (2016)     Grand Rapids, MI (20~      0.621      0.647      0.595
allAccuracy_month <- select(acc_TempF_DewF_month, 
                            locale=locale1, 
                            other=locale2, 
                            accOverall, 
                            accLocale=accLocale1
                            ) %>%
    bind_rows(select(acc_TempF_DewF_month, 
                     locale=locale2, 
                     other=locale1, 
                     accOverall, 
                     accLocale=accLocale2
                     )
              )

# Overall accuracy by location plot
allAccuracy_month %>%
    group_by(locale) %>%
    summarize_if(is.numeric, mean) %>%
    ggplot(aes(x=fct_reorder(locale, accOverall), y=accOverall)) + 
    geom_point(size=2) + 
    geom_text(aes(y=accOverall+0.02, label=paste0(round(100*accOverall), "%"))) +
    coord_flip() + 
    labs(x="", 
         y="Average Accuracy", 
         title="Average Accuracy Predicting Locale",
         subtitle="Predictions made 1:1 to each other locale (average accuracy reported)",
         caption="Temperature, Dew Point, Month as predictors\n(50% is baseline coinflip accuracy)"
         ) + 
    ylim(c(0.5, 1))

# Overall accuracy heatmap (FIX for better ordering of locales)
# allAccuracy_month %>% 
#     ggplot(aes(x=locale, y=other)) + 
#     geom_tile(aes(fill=accOverall)) + 
#     theme(axis.text.x=element_text(angle=90)) + 
#     scale_fill_continuous("Accuracy", high="darkblue", low="white") + 
#     labs(title="Accuracy Predicting Locale vs. Locale", 
#          caption="Temperature, Dew Point, and Month as predictors\n(50% is baseline coinflip accuracy)",
#          x="",
#          y=""
#          )

Accuracy is best for cities with significantly different locales. The model is especially successful at pulling apart the desert cities (Las Vegas, Phoenix) from everything else, and the humid Florida cities (Miami, Tampa Bay) from everything else.

Next, the simple model is run to classify locale across the full 2016 dataset:

# Run random forest for all 2016 locales
rf_all_2016_TDm <- rfMultiLocale(metarData, 
                                 vrbls=c("TempF", "DewF", "month"),
                                 locs=names_2016, 
                                 ntree=50, 
                                 seed=2006281508
                                 )

Summaries can then be created for the accuracy in predicting each locale:

evalPredictions(rf_all_2016_TDm, plotCaption = "Temp, Dew Point, Month, Cloud/Ceiling Height")

## # A tibble: 898 x 5
##    locale             predicted               correct     n    pct
##    <fct>              <fct>                   <lgl>   <int>  <dbl>
##  1 Atlanta, GA (2016) Atlanta, GA (2016)      TRUE      484 0.182 
##  2 Atlanta, GA (2016) Boston, MA (2016)       FALSE      49 0.0185
##  3 Atlanta, GA (2016) Chicago, IL (2016)      FALSE      39 0.0147
##  4 Atlanta, GA (2016) Dallas, TX (2016)       FALSE     179 0.0675
##  5 Atlanta, GA (2016) Denver, CO (2016)       FALSE      34 0.0128
##  6 Atlanta, GA (2016) Detroit, MI (2016)      FALSE      42 0.0158
##  7 Atlanta, GA (2016) Grand Rapids, MI (2016) FALSE      48 0.0181
##  8 Atlanta, GA (2016) Green Bay, WI (2016)    FALSE      42 0.0158
##  9 Atlanta, GA (2016) Houston, TX (2016)      FALSE      96 0.0362
## 10 Atlanta, GA (2016) Indianapolis, IN (2016) FALSE      49 0.0185
## # ... with 888 more rows

Accuracy is meaningfully increased compared to the null accuracy, though there is still under 50% accuracy in classifying any locale. This is suggestive that the goal will be to define some archetype climates, and to use those for predicting (e.g., humid, cold, desert, etc.).

Clouds (minimum levels and ceiling heights) can potentially help further differentiate the cold weather cities on the downwind side of the lake, as well as helping to further pull apart various marine and mid-continental climates.

An updated random forest model is then run:

# Run random forest for all 2016 locales
rf_all_2016_TDmc <- rfMultiLocale(metarData, 
                                  vrbls=c("TempF", "DewF", "month", "minHeight", "ceilingHeight"),
                                  locs=names_2016, 
                                  ntree=50, 
                                  seed=2006281509
                                  )

The evaluation process is converted to a function:

evalPredictions(rf_all_2016_TDmc, plotCaption = "Temp, Dew Point, Month, Cloud/Ceiling Height")

## # A tibble: 897 x 5
##    locale             predicted               correct     n     pct
##    <fct>              <fct>                   <lgl>   <int>   <dbl>
##  1 Atlanta, GA (2016) Atlanta, GA (2016)      TRUE      643 0.240  
##  2 Atlanta, GA (2016) Boston, MA (2016)       FALSE      84 0.0313 
##  3 Atlanta, GA (2016) Chicago, IL (2016)      FALSE      61 0.0227 
##  4 Atlanta, GA (2016) Dallas, TX (2016)       FALSE     145 0.0541 
##  5 Atlanta, GA (2016) Denver, CO (2016)       FALSE      44 0.0164 
##  6 Atlanta, GA (2016) Detroit, MI (2016)      FALSE      51 0.0190 
##  7 Atlanta, GA (2016) Grand Rapids, MI (2016) FALSE      49 0.0183 
##  8 Atlanta, GA (2016) Green Bay, WI (2016)    FALSE      26 0.00969
##  9 Atlanta, GA (2016) Houston, TX (2016)      FALSE      82 0.0306 
## 10 Atlanta, GA (2016) Indianapolis, IN (2016) FALSE     100 0.0373 
## # ... with 887 more rows
# Get the comparison to previous accuracy
deltaAccuracy(prevObject=rf_all_2016_TDm, currObject=rf_all_2016_TDmc)

Accuracy increases meaningfully, though still well under 50% in most cases.

The model can also be built out to consider wind speed and wind direction. No attempt yet is made to control for over-fitting:

# Run random forest for all 2016 locales
rf_all_2016_TDmcw <- rfMultiLocale(metarData, 
                                   vrbls=c("TempF", "DewF", 
                                           "month", 
                                           "minHeight", "ceilingHeight", 
                                           "WindSpeed", "predomDir"
                                           ),
                                   locs=names_2016, 
                                   ntree=50, 
                                   seed=2006281934
                                   )

The evaluation process can again be run:

# Get the graphs of the prediction accuracy
evalPredictions(rf_all_2016_TDmcw, 
                plotCaption = "Temp, Dew Point, Month, Cloud/Ceiling Height, Wind Speed/Direction"
                )

## # A tibble: 889 x 5
##    locale             predicted               correct     n     pct
##    <fct>              <fct>                   <lgl>   <int>   <dbl>
##  1 Atlanta, GA (2016) Atlanta, GA (2016)      TRUE     1096 0.415  
##  2 Atlanta, GA (2016) Boston, MA (2016)       FALSE      34 0.0129 
##  3 Atlanta, GA (2016) Chicago, IL (2016)      FALSE      46 0.0174 
##  4 Atlanta, GA (2016) Dallas, TX (2016)       FALSE     135 0.0511 
##  5 Atlanta, GA (2016) Denver, CO (2016)       FALSE      28 0.0106 
##  6 Atlanta, GA (2016) Detroit, MI (2016)      FALSE      31 0.0117 
##  7 Atlanta, GA (2016) Grand Rapids, MI (2016) FALSE      26 0.00984
##  8 Atlanta, GA (2016) Green Bay, WI (2016)    FALSE      24 0.00908
##  9 Atlanta, GA (2016) Houston, TX (2016)      FALSE      58 0.0219 
## 10 Atlanta, GA (2016) Indianapolis, IN (2016) FALSE      59 0.0223 
## # ... with 879 more rows
# Get the comparison to previous accuracy
deltaAccuracy(prevObject=rf_all_2016_TDmc, currObject=rf_all_2016_TDmcw)

Including wind significantly improves model accuracy for many locales. Even the cold weather cities are now being predicted with around 30%-40% accucacy.

The model can also be built out to consider hour of day and sea-level pressure. No attempt yet is made to control for over-fitting, with the exception that mtry is held at 4:

# Run random forest for all 2016 locales
rf_all_2016_TDmcwa <- rfMultiLocale(metarData, 
                                    vrbls=c("TempF", "DewF", 
                                            "month",
                                            "minHeight", "ceilingHeight", 
                                            "WindSpeed", "predomDir",
                                            "modSLP"
                                            ),
                                    locs=names_2016, 
                                    ntree=50, 
                                    mtry=4,
                                    seed=2006282103
                                    )

The evaluation process can again be run:

# Get the graphs of the prediction accuracy
evalPredictions(rf_all_2016_TDmcwa, 
                plotCaption = "Temp, Dew Point, Month, Cloud/Ceiling Height, Wind Speed/Direction, SLP"
                )

## # A tibble: 880 x 5
##    locale             predicted               correct     n     pct
##    <fct>              <fct>                   <lgl>   <int>   <dbl>
##  1 Atlanta, GA (2016) Atlanta, GA (2016)      TRUE     1639 0.625  
##  2 Atlanta, GA (2016) Boston, MA (2016)       FALSE      16 0.00610
##  3 Atlanta, GA (2016) Chicago, IL (2016)      FALSE      23 0.00877
##  4 Atlanta, GA (2016) Dallas, TX (2016)       FALSE      75 0.0286 
##  5 Atlanta, GA (2016) Denver, CO (2016)       FALSE       8 0.00305
##  6 Atlanta, GA (2016) Detroit, MI (2016)      FALSE      18 0.00686
##  7 Atlanta, GA (2016) Grand Rapids, MI (2016) FALSE      12 0.00457
##  8 Atlanta, GA (2016) Green Bay, WI (2016)    FALSE       6 0.00229
##  9 Atlanta, GA (2016) Houston, TX (2016)      FALSE      45 0.0172 
## 10 Atlanta, GA (2016) Indianapolis, IN (2016) FALSE      51 0.0194 
## # ... with 870 more rows
# Get the comparison to previous accuracy
deltaAccuracy(prevObject=rf_all_2016_TDmcw, currObject=rf_all_2016_TDmcwa)

Adding sea-level pressure again significantly improves prediction ability. The cold weather cities are in the roughly 40%-60% accuracy range, and the other cities are in the roughly 60%-80% accuracy range.

Based on the hierarchical clustering and results of the initial random forest models, a few archetype cities are defined in an attempt to pull out distinctions:

  • Miami (likely to match with Tampa, New Orleans, and Houston)
  • Las Vegas (likely to match with Phoenix)
  • Los Angeles (likely to match with San Diego and San Jose)
  • Denver (seems potentially isolated)
  • Seattle (seems potentially isolated)
  • Dallas (likely to match to San Antonio)
  • Minneapolis (surrogate for the cold side of cold weather cities)
  • Atlanta (possible to match with DC, St Louis, Indianapolis, but may be too similar to Cold/Miami)

A data subset is created, with “hr” added for the Zulu hour of the observation (as integer rather than as factor):

archeCities <- c("kmia_2016", "klas_2016", "klax_2016", "kden_2016", 
                 "ksea_2016", "kdfw_2016", "kmsp_2016", "katl_2016"
                 )

archeData <- metarData %>%
    filter(source %in% archeCities) %>%
    mutate(hr=lubridate::hour(dtime))

A random forest, with mtry=4, is then run using all variables from previous, as well as hour:

# Run random forest for all 2016 locales
rf_arche_2016_TDmcwa <- rfMultiLocale(archeData, 
                                      vrbls=c("TempF", "DewF", 
                                              "month", "hr",
                                              "minHeight", "ceilingHeight", 
                                              "WindSpeed", "predomDir",
                                              "modSLP"
                                              ),
                                      locs=NULL, # defaults to running for all locales
                                      ntree=100, 
                                      mtry=4,
                                      seed=2006291353
                                      )
## 
## Running for locations:
## [1] "katl_2016" "kden_2016" "kdfw_2016" "klas_2016" "klax_2016" "kmia_2016"
## [7] "kmsp_2016" "ksea_2016"

The evaluation process can again be run:

# Get the graphs of the prediction accuracy
evalPredictions(rf_arche_2016_TDmcwa, 
                plotCaption = "Temp, Dew Point, Month/Hour, Cloud/Ceiling Height, Wind Speed/Direction, SLP"
                )

## # A tibble: 62 x 5
##    locale             predicted              correct     n    pct
##    <fct>              <fct>                  <lgl>   <int>  <dbl>
##  1 Atlanta, GA (2016) Atlanta, GA (2016)     TRUE     2204 0.835 
##  2 Atlanta, GA (2016) Dallas, TX (2016)      FALSE     139 0.0527
##  3 Atlanta, GA (2016) Denver, CO (2016)      FALSE      39 0.0148
##  4 Atlanta, GA (2016) Las Vegas, NV (2016)   FALSE      33 0.0125
##  5 Atlanta, GA (2016) Los Angeles, CA (2016) FALSE      86 0.0326
##  6 Atlanta, GA (2016) Miami, FL (2016)       FALSE      50 0.0189
##  7 Atlanta, GA (2016) Minneapolis, MN (2016) FALSE      45 0.0170
##  8 Atlanta, GA (2016) Seattle, WA (2016)     FALSE      44 0.0167
##  9 Dallas, TX (2016)  Atlanta, GA (2016)     FALSE     203 0.0789
## 10 Dallas, TX (2016)  Dallas, TX (2016)      TRUE     2105 0.818 
## # ... with 52 more rows

The model is reasonably successful in pulling apart the archetypes. Denver and Dallas seem potentially less distinct, though all archetypes cities are predicted at 80%+ accuracy.

The remaining cities can be classified against the archetype data, to explore any patterns that emerge:

localeByArchetype(rf_arche_2016_TDmcwa, 
                  fullData=metarData, 
                  archeCities=archeCities, 
                  sortDescMatch=TRUE
                  )

There appear to be several issues:

  • Atlanta and Dallas are not particularly distinct and do not make for good archetypes (potentially, Miami, Atlanta, and Dallas can all be replaced by Houston); Denver and Seattle are shaky but may be OK
  • Minneapolis may be too cold, driving the midwestern cities to instead classify partially as Denver
  • The model is potentially learning too much about the specific city rather than about archetypes

The first issue is addressed by collapsing Atlanta, Dallas, and Miami and instead using Houston as the archetype; and replacing Minneapolis with Chicago:

archeCities_v02 <- c("kiah_2016", "klas_2016", "klax_2016", 
                     "kden_2016", "ksea_2016", "kord_2016"
                     )

archeData_v02 <- metarData %>%
    filter(source %in% archeCities_v02) %>%
    mutate(hr=lubridate::hour(dtime))

A random forest, with mtry=4, is then run using all variables from previous, as well as hour:

# Run random forest for all 2016 locales in archeData_v02
rf_arche_2016_TDmcwa_v02 <- rfMultiLocale(archeData_v02, 
                                          vrbls=c("TempF", "DewF", 
                                                  "month", "hr",
                                                  "minHeight", "ceilingHeight", 
                                                  "WindSpeed", "predomDir",
                                                  "modSLP"
                                                  ),
                                          locs=NULL, # defaults to running for all locales
                                          ntree=100, 
                                          mtry=4,
                                          seed=2006291448
                                          )
## 
## Running for locations:
## [1] "kden_2016" "kiah_2016" "klas_2016" "klax_2016" "kord_2016" "ksea_2016"

The evaluation process can again be run:

# Get the graphs of the prediction accuracy
evalPredictions(rf_arche_2016_TDmcwa_v02, 
                plotCaption = "Temp, Dew Point, Month/Hour, Cloud/Ceiling Height, Wind Speed/Direction, SLP"
                )

## # A tibble: 36 x 5
##    locale             predicted              correct     n     pct
##    <fct>              <fct>                  <lgl>   <int>   <dbl>
##  1 Chicago, IL (2016) Chicago, IL (2016)     TRUE     2321 0.872  
##  2 Chicago, IL (2016) Denver, CO (2016)      FALSE      99 0.0372 
##  3 Chicago, IL (2016) Houston, TX (2016)     FALSE      54 0.0203 
##  4 Chicago, IL (2016) Las Vegas, NV (2016)   FALSE      13 0.00489
##  5 Chicago, IL (2016) Los Angeles, CA (2016) FALSE      70 0.0263 
##  6 Chicago, IL (2016) Seattle, WA (2016)     FALSE     104 0.0391 
##  7 Denver, CO (2016)  Chicago, IL (2016)     FALSE     119 0.0453 
##  8 Denver, CO (2016)  Denver, CO (2016)      TRUE     2282 0.869  
##  9 Denver, CO (2016)  Houston, TX (2016)     FALSE       3 0.00114
## 10 Denver, CO (2016)  Las Vegas, NV (2016)   FALSE     126 0.0480 
## # ... with 26 more rows

The model is reasonably successful in pulling apart the archetypes. Houston appears the most distinct (as had Miami previously), and there is less overlap going to/from Denver or Seattle.

The remaining cities can be classified against the archetype data, to explore any patterns that emerge:

localeByArchetype(rf_arche_2016_TDmcwa_v02, 
                  fullData=metarData, 
                  archeCities=archeCities_v02, 
                  sortDescMatch = TRUE
                  )

At a glance, the archetypes are more sensible. There is a bit of Denver and a bit of Seattle in most of the cold weather cities, and a bit of Houston in some of the warmer cold weather cities. This is suggestive that there are either more then one type of cold-weather cities, or that cold-weather cities tend to be like other archetypes are certain times.

The model is run again, deleting Denver and Seattle, and converting Los Angeles to San Diego:

archeCities_v03 <- c("kiah_2016", "klas_2016", "ksan_2016", "kord_2016")

archeData_v03 <- metarData %>%
    filter(source %in% archeCities_v03) %>%
    mutate(hr=lubridate::hour(dtime))

A random forest, with mtry=4, is then run using all variables from previous, as well as hour:

# Run random forest for all 2016 locales in archeData_v03
rf_arche_2016_TDmcwa_v03 <- rfMultiLocale(archeData_v03, 
                                          vrbls=c("TempF", "DewF", 
                                                  "month", "hr",
                                                  "minHeight", "ceilingHeight", 
                                                  "WindSpeed", "predomDir",
                                                  "modSLP"
                                                  ),
                                          locs=NULL, # defaults to running for all locales
                                          ntree=100, 
                                          mtry=4,
                                          seed=2006291459
                                          )
## 
## Running for locations:
## [1] "kiah_2016" "klas_2016" "kord_2016" "ksan_2016"

The evaluation process can again be run:

# Get the graphs of the prediction accuracy
evalPredictions(rf_arche_2016_TDmcwa_v03, 
                plotCaption = "Temp, Dew Point, Month/Hour, Cloud/Ceiling Height, Wind Speed/Direction, SLP"
                )
## Warning: Removed 1 rows containing missing values (geom_text).

## # A tibble: 16 x 5
##    locale               predicted            correct     n     pct
##    <fct>                <fct>                <lgl>   <int>   <dbl>
##  1 Chicago, IL (2016)   Chicago, IL (2016)   TRUE     2511 0.946  
##  2 Chicago, IL (2016)   Houston, TX (2016)   FALSE      69 0.0260 
##  3 Chicago, IL (2016)   Las Vegas, NV (2016) FALSE      26 0.00979
##  4 Chicago, IL (2016)   San Diego, CA (2016) FALSE      49 0.0185 
##  5 Houston, TX (2016)   Chicago, IL (2016)   FALSE      44 0.0170 
##  6 Houston, TX (2016)   Houston, TX (2016)   TRUE     2473 0.954  
##  7 Houston, TX (2016)   Las Vegas, NV (2016) FALSE      19 0.00733
##  8 Houston, TX (2016)   San Diego, CA (2016) FALSE      57 0.0220 
##  9 Las Vegas, NV (2016) Chicago, IL (2016)   FALSE      29 0.0113 
## 10 Las Vegas, NV (2016) Houston, TX (2016)   FALSE      13 0.00508
## 11 Las Vegas, NV (2016) Las Vegas, NV (2016) TRUE     2482 0.971  
## 12 Las Vegas, NV (2016) San Diego, CA (2016) FALSE      33 0.0129 
## 13 San Diego, CA (2016) Chicago, IL (2016)   FALSE      35 0.0130 
## 14 San Diego, CA (2016) Houston, TX (2016)   FALSE      29 0.0108 
## 15 San Diego, CA (2016) Las Vegas, NV (2016) FALSE      63 0.0235 
## 16 San Diego, CA (2016) San Diego, CA (2016) TRUE     2559 0.953

These archetypes are distinct, with accuracies all in the 95% range. The remaining cities can be classified against the archetype data, to explore any patterns that emerge:

localeByArchetype(rf_arche_2016_TDmcwa_v03,
                  fullData=metarData,
                  archeCities=archeCities_v03,
                  sortDescMatch=TRUE
                  )

Quite a few cities show as blends of the archetype cities, which is likely correct if archetypes are considered to be Cold, Humid, Marine, Desert. An open question is whether to expand that list with another 1-2 archetypes that attempt to better capture Atlanta, Dallas, Denver, St Louis, and DC.

The Marine archetype may need to be converted to Pacific and to better identify the California entities.

A decision about Seattle is needed, as it is a little like everything and everything is a little like it.

To address the issue of the model learning from specific cities rather than archetypes, user-defined clusters will be created, and data from these clusters will be used in modeling:

  • Marine - San Diego, Los Angeles in an attempt to train the model to recognize that Marine cities differ meaningfully from Cold cities and Humid cities (San Jose, San Francicso and Seattle are somewhat different climates)
  • Desert - Las Vegas, Phoenix in an attempt to train the model on differences in Desert and Marine
  • Humid - Houston, Miami, New Orleans, Tampa, reflecting that the model is already largely pulling out the humid cities correctly
  • Cold - Chicago, Milwaukee, Grand Rapids, Green Bay, Traverse City, Madison, Detroit, Minneapolis to reflect that the model is largely already pulling out cold cities correctly (Boston excluded for now)
  • South - Atlanta, Dallas, San Antonio in an attempt to pull out southern cities that are not as humid as those along the Gulf Coast
  • Mixed - Indianapolis, Lincoln, St Louis, to reflect cities that have milder winters than Cold but colder winters than South
  • Excluded for now - Denver, Seattle, Boston/Newark/Phialdelphia/DC, San Francisco/San Jose

Further, a variable will be created for “hr”, the Zulu hour of the observation:

# Create the locale mapper
locMapperTibble <- tibble::tribble(
    ~source, ~locType,
    'katl_2016', 'South',
    'kbos_2016', 'Exclude',
    'kdca_2016', 'Exclude',
    'kden_2016', 'Exclude',
    'kdfw_2016', 'South',
    'kdtw_2016', 'Cold',
    'kewr_2016', 'Exclude',
    'kgrb_2016', 'Cold',
    'kgrr_2016', 'Cold',
    'kiah_2016', 'Humid',
    'kind_2016', 'Mixed',
    'klas_2016', 'Desert',
    'klax_2016', 'Marine',
    'klnk_2016', 'Mixed',
    'kmia_2016', 'Humid',
    'kmke_2016', 'Cold',
    'kmsn_2016', 'Cold',
    'kmsp_2016', 'Cold',
    'kmsy_2016', 'Humid',
    'kord_2016', 'Cold',
    'kphl_2016', 'Exclude',
    'kphx_2016', 'Desert',
    'ksan_2016', 'Marine',
    'ksat_2016', 'South',
    'ksea_2016', 'Exclude',
    'ksfo_2016', 'Exclude',
    'ksjc_2016', 'Exclude',
    'kstl_2016', 'Mixed',
    'ktpa_2016', 'Humid',
    'ktvc_2016', 'Cold'
)

# Create locMapper
locMapper <- locMapperTibble %>% pull(locType)
names(locMapper) <- locMapperTibble %>% pull(source)
locMapper
## katl_2016 kbos_2016 kdca_2016 kden_2016 kdfw_2016 kdtw_2016 kewr_2016 kgrb_2016 
##   "South" "Exclude" "Exclude" "Exclude"   "South"    "Cold" "Exclude"    "Cold" 
## kgrr_2016 kiah_2016 kind_2016 klas_2016 klax_2016 klnk_2016 kmia_2016 kmke_2016 
##    "Cold"   "Humid"   "Mixed"  "Desert"  "Marine"   "Mixed"   "Humid"    "Cold" 
## kmsn_2016 kmsp_2016 kmsy_2016 kord_2016 kphl_2016 kphx_2016 ksan_2016 ksat_2016 
##    "Cold"    "Cold"   "Humid"    "Cold" "Exclude"  "Desert"  "Marine"   "South" 
## ksea_2016 ksfo_2016 ksjc_2016 kstl_2016 ktpa_2016 ktvc_2016 
## "Exclude" "Exclude" "Exclude"   "Mixed"   "Humid"    "Cold"
# Create the data file with locType
locData_v04 <- metarData %>% 
    mutate(locType=locMapper[source], hr=lubridate::hour(dtime)) %>% 
    filter(year==2016, locType !="Exclude")

# Counts by locType
locData_v04 %>%
    count(locType)
## # A tibble: 6 x 2
##   locType     n
##   <chr>   <int>
## 1 Cold    70101
## 2 Desert  17543
## 3 Humid   35074
## 4 Marine  17536
## 5 Mixed   26187
## 6 South   26309

There is a risk that the model will predict neutral cities as ‘Cold’ given the 4:1 ratio of Cold cities to Marine/Desert cities. The random forest model is first run on the data ‘as is’:

rf_arche_small_TDmcwa_v04 <- rfMultiLocale(locData_v04, 
                                           vrbls=c("TempF", "DewF", 
                                                   "month", "hr",
                                                   "minHeight", "ceilingHeight", 
                                                   "WindSpeed", "predomDir",
                                                   "modSLP"
                                                   ),
                                           locs=NULL, # defaults to running for all locales
                                           ntree=100, 
                                           locVar="locType", 
                                           pred="locType",
                                           mtry=4, 
                                           otherVar=c("dtime", "source", "locale"),
                                           seed=2006301313
                                           )
## 
## Running for locations:
## [1] "Cold"   "Desert" "Humid"  "Marine" "Mixed"  "South"

Prediction accuracy is then assessed:

# Get the graphs of the prediction accuracy
evalPredictions(rf_arche_small_TDmcwa_v04,
                plotCaption = "Temp, Dew Point, Month/Hour, Cloud/Ceiling Height, Wind Speed/Direction, SLP",
                keyVar="locType"
                )

## # A tibble: 36 x 5
##    locale predicted correct     n     pct
##    <fct>  <fct>     <lgl>   <int>   <dbl>
##  1 Cold   Cold      TRUE    19819 0.947  
##  2 Cold   Desert    FALSE      37 0.00177
##  3 Cold   Humid     FALSE     119 0.00568
##  4 Cold   Marine    FALSE     159 0.00759
##  5 Cold   Mixed     FALSE     549 0.0262 
##  6 Cold   South     FALSE     254 0.0121 
##  7 Desert Cold      FALSE     122 0.0234 
##  8 Desert Desert    TRUE     4790 0.921  
##  9 Desert Humid     FALSE      41 0.00788
## 10 Desert Marine    FALSE     115 0.0221 
## # ... with 26 more rows

The designation of Cold, Desert, Humid, and Marine seems successful. The attempt to split South from Humid resulted in good, if incomplete, separation. The attempt to separate a group of ‘warmer’ cold-weather cities from ‘colder’ cold-weather cities was not particularly successful. This could partially be an artifact of ‘Cold’ having double the data volume as ‘Mixed’.

Classifications by city can also be examined:

archeCities_v04 <- locMapperTibble %>%
    filter(locType != "Exclude") %>%
    pull(source)

localeByArchetype(rf_arche_small_TDmcwa_v04, 
                  fullData=mutate(metarData, locType=locMapper[source]), 
                  archeCities=archeCities_v04, 
                  sortDescMatch=TRUE
                  )

Many of the cities are nicely classified in to their assigned archetypes, as desired. Among the cities used in the classifications, concerns include:

  • The cities pre-assigned as South do not seem to cluster so strongly together. This may not be a good archetype
  • The cities pre-assigned as Mixed do not clutser well, possibly driven by a data volume discrepancy where Cold has more than double the data
  • Remaining cities frequently assign to 2+ archetypes, as expected since weather in the mid-latitudes would not necessarily follow any archetypal pattern on a year-long basis

Classifications are changed somewhat, and data are then filtered so that an equal number of observations from each locale type are applied in the modeling:

  • Mixed - deleted, entries will be moved to ‘Exclude’ to see how they map
  • South - deleted, entries will be moved to ‘Exclude’ to see how they map
# Create the locale mapper
locMapperTibble_v05 <- tibble::tribble(
    ~source, ~locType,
    'katl_2016', 'Exclude',
    'kbos_2016', 'Exclude',
    'kdca_2016', 'Exclude',
    'kden_2016', 'Exclude',
    'kdfw_2016', 'Exclude',
    'kdtw_2016', 'Cold',
    'kewr_2016', 'Exclude',
    'kgrb_2016', 'Cold',
    'kgrr_2016', 'Cold',
    'kiah_2016', 'Humid',
    'kind_2016', 'Exclude',
    'klas_2016', 'Desert',
    'klax_2016', 'Marine',
    'klnk_2016', 'Exclude',
    'kmia_2016', 'Humid',
    'kmke_2016', 'Cold',
    'kmsn_2016', 'Cold',
    'kmsp_2016', 'Cold',
    'kmsy_2016', 'Humid',
    'kord_2016', 'Cold',
    'kphl_2016', 'Exclude',
    'kphx_2016', 'Desert',
    'ksan_2016', 'Marine',
    'ksat_2016', 'Exclude',
    'ksea_2016', 'Exclude',
    'ksfo_2016', 'Exclude',
    'ksjc_2016', 'Exclude',
    'kstl_2016', 'Exclude',
    'ktpa_2016', 'Humid',
    'ktvc_2016', 'Cold'
)

# Create locMapper
locMapper_v05 <- locMapperTibble_v05 %>% pull(locType)
names(locMapper_v05) <- locMapperTibble_v05 %>% pull(source)
locMapper_v05
## katl_2016 kbos_2016 kdca_2016 kden_2016 kdfw_2016 kdtw_2016 kewr_2016 kgrb_2016 
## "Exclude" "Exclude" "Exclude" "Exclude" "Exclude"    "Cold" "Exclude"    "Cold" 
## kgrr_2016 kiah_2016 kind_2016 klas_2016 klax_2016 klnk_2016 kmia_2016 kmke_2016 
##    "Cold"   "Humid" "Exclude"  "Desert"  "Marine" "Exclude"   "Humid"    "Cold" 
## kmsn_2016 kmsp_2016 kmsy_2016 kord_2016 kphl_2016 kphx_2016 ksan_2016 ksat_2016 
##    "Cold"    "Cold"   "Humid"    "Cold" "Exclude"  "Desert"  "Marine" "Exclude" 
## ksea_2016 ksfo_2016 ksjc_2016 kstl_2016 ktpa_2016 ktvc_2016 
## "Exclude" "Exclude" "Exclude" "Exclude"   "Humid"    "Cold"
# Create the data file with locType
locData_v05 <- metarData %>% 
    mutate(locType=locMapper_v05[source], hr=lubridate::hour(dtime)) %>% 
    filter(year==2016, locType !="Exclude")

# Counts by locType
locData_v05 %>%
    count(locType)
## # A tibble: 4 x 2
##   locType     n
##   <chr>   <int>
## 1 Cold    70101
## 2 Desert  17543
## 3 Humid   35074
## 4 Marine  17536

The data are the filtered so that there are an equal number of observations from each locale type. The model can later be predicted against the remaining items:

# Set a seed for reporducibility
set.seed(2006301356)

# Find the smallest locale type
nSmall <- locData_v05 %>%
    filter(!is.na(locType)) %>%
    count(locType) %>%
    pull(n) %>%
    min()

# Create the relevant data subset
subData_v05 <- locData_v05 %>%
    filter(!is.na(locType)) %>%
    group_by(locType) %>%
    sample_n(size=nSmall, replace=FALSE) %>%
    ungroup()

# Sumarize the data subset
subData_v05 %>% 
    count(locType, locale) %>% 
    arrange(-n)
## # A tibble: 16 x 3
##    locType locale                       n
##    <chr>   <chr>                    <int>
##  1 Marine  Los Angeles, CA (2016)    8774
##  2 Desert  Phoenix, AZ (2016)        8770
##  3 Desert  Las Vegas, NV (2016)      8766
##  4 Marine  San Diego, CA (2016)      8762
##  5 Humid   New Orleans, LA (2016)    4402
##  6 Humid   Miami, FL (2016)          4388
##  7 Humid   Houston, TX (2016)        4386
##  8 Humid   Tampa Bay, FL (2016)      4360
##  9 Cold    Grand Rapids, MI (2016)   2234
## 10 Cold    Milwaukee, WI (2016)      2231
## 11 Cold    Minneapolis, MN (2016)    2205
## 12 Cold    Madison, WI (2016)        2188
## 13 Cold    Traverse City, MI (2016)  2182
## 14 Cold    Detroit, MI (2016)        2173
## 15 Cold    Chicago, IL (2016)        2168
## 16 Cold    Green Bay, WI (2016)      2155

The previous model can then be run on the data subset:

rf_arche_small_TDmcwa_v05 <- rfMultiLocale(subData_v05, 
                                           vrbls=c("TempF", "DewF", 
                                                   "month", "hr",
                                                   "minHeight", "ceilingHeight", 
                                                   "WindSpeed", "predomDir",
                                                   "modSLP"
                                                   ),
                                           locs=NULL, # defaults to running for all locales
                                           ntree=100, 
                                           locVar="locType", 
                                           pred="locType",
                                           mtry=4, 
                                           otherVar=c("dtime", "source", "locale"),
                                           seed=2006301358
                                           )
## 
## Running for locations:
## [1] "Cold"   "Desert" "Humid"  "Marine"

The evaluation process can again be run:

# Get the graphs of the prediction accuracy
evalPredictions(rf_arche_small_TDmcwa_v05,
                plotCaption = "Temp, Dew Point, Month/Hour, Cloud/Ceiling Height, Wind Speed/Direction, SLP",
                keyVar="locType"
                )

## # A tibble: 16 x 5
##    locale predicted correct     n     pct
##    <fct>  <fct>     <lgl>   <int>   <dbl>
##  1 Cold   Cold      TRUE     4864 0.931  
##  2 Cold   Desert    FALSE      68 0.0130 
##  3 Cold   Humid     FALSE     116 0.0222 
##  4 Cold   Marine    FALSE     175 0.0335 
##  5 Desert Cold      FALSE      58 0.0112 
##  6 Desert Desert    TRUE     4925 0.955  
##  7 Desert Humid     FALSE      34 0.00659
##  8 Desert Marine    FALSE     141 0.0273 
##  9 Humid  Cold      FALSE      99 0.0188 
## 10 Humid  Desert    FALSE      64 0.0121 
## 11 Humid  Humid     TRUE     4916 0.932  
## 12 Humid  Marine    FALSE     194 0.0368 
## 13 Marine Cold      FALSE      88 0.0166 
## 14 Marine Desert    FALSE     210 0.0396 
## 15 Marine Humid     FALSE      81 0.0153 
## 16 Marine Marine    TRUE     4920 0.928

The archetypes are well-separated, with roughly 95% accuracy in every case. Predictions can then also be made for all cities, including those not in the modeling:

archeCities_v05 <- locMapperTibble_v05 %>%
    filter(locType != "Exclude") %>%
    pull(source)

localeByArchetype(rf_arche_small_TDmcwa_v05, 
                  fullData=mutate(metarData, locType=locMapper_v05[source]), 
                  archeCities=archeCities_v05, 
                  sortDescMatch=TRUE
                  )

Broadly speaking, cities used in modeling appear to classify in to the appropriate archetypes. For the cities not included in the modeling:

  • Boston, Lincoln, Indianapolis, Newark, Philadelphia, and Seattle are most closely matched to ‘Cold’
  • San Antonio is most closely matched to ‘Humid’
  • Denver is a split of Cold/Desert
  • DC and St Louis are a split of Cold/Humid
  • San Francisco and San Jose are most closely matched to ‘Marine’ with meaningful ‘Cold’ and ‘Humid’ also
  • Atlanta and Dallas are roughly split between ‘Humid’ and ‘Cold’, with a bit of ‘Desert’ also

The model can be applied to data from years other than 2016, to see how the classifications are impacted by use in out-of-sample years:

# Predictions on non-2016 data
helperPredictPlot(rf_arche_small_TDmcwa_v05$rfModel, 
                  df=filter(mutate(metarData, hr=lubridate::hour(dtime)), year!=2016), 
                  predOrder=c("Marine", "Humid", "Desert", "Cold"), 
                  locMapper=locMapper_v05
                  )

## # A tibble: 80 x 4
##    locale             predicted     n    pct
##    <chr>              <fct>     <int>  <dbl>
##  1 Chicago, IL (2014) Cold       7928 0.915 
##  2 Chicago, IL (2014) Desert      273 0.0315
##  3 Chicago, IL (2014) Humid       144 0.0166
##  4 Chicago, IL (2014) Marine      318 0.0367
##  5 Chicago, IL (2015) Cold       7722 0.886 
##  6 Chicago, IL (2015) Desert      247 0.0283
##  7 Chicago, IL (2015) Humid       340 0.0390
##  8 Chicago, IL (2015) Marine      411 0.0471
##  9 Chicago, IL (2017) Cold       7660 0.878 
## 10 Chicago, IL (2017) Desert      466 0.0534
## # ... with 70 more rows

Model performance on non-2016 data is not as strong, with roughly a 10%-20% loss of accuracy. Predictions are still much better than null accuracy, and the model (mostly) continues to separate the archetypes.

Suppose that models are run on all 2015-2018 data for Chicago, Las Vegas, New Orleans, and San Diego:

# Create the subset for Chicago, Las Vegas, New Orleans, San Diego
sub_2015_2018_data <- metarData %>%
    filter(str_sub(source, 1, 4) %in% c("kord", "klas", "kmsy", "ksan"), 
           year %in% c(2015, 2016, 2017, 2018)
           ) %>%
    mutate(city=str_replace(locale, pattern=" .\\d{4}.", replacement=""), 
           hr=lubridate::hour(dtime)
           )

# Check that proper locales are included
sub_2015_2018_data %>% 
    count(city, locale)
## # A tibble: 16 x 3
##    city            locale                     n
##    <chr>           <chr>                  <int>
##  1 Chicago, IL     Chicago, IL (2015)      8728
##  2 Chicago, IL     Chicago, IL (2016)      8767
##  3 Chicago, IL     Chicago, IL (2017)      8740
##  4 Chicago, IL     Chicago, IL (2018)      8737
##  5 Las Vegas, NV   Las Vegas, NV (2015)    8727
##  6 Las Vegas, NV   Las Vegas, NV (2016)    8770
##  7 Las Vegas, NV   Las Vegas, NV (2017)    8664
##  8 Las Vegas, NV   Las Vegas, NV (2018)    8746
##  9 New Orleans, LA New Orleans, LA (2015)  8727
## 10 New Orleans, LA New Orleans, LA (2016)  8767
## 11 New Orleans, LA New Orleans, LA (2017)  8723
## 12 New Orleans, LA New Orleans, LA (2018)  8737
## 13 San Diego, CA   San Diego, CA (2015)    8737
## 14 San Diego, CA   San Diego, CA (2016)    8762
## 15 San Diego, CA   San Diego, CA (2017)    8744
## 16 San Diego, CA   San Diego, CA (2018)    8744

The random forest model is run and cached:

# Run random forest for 2015-2018 data
rf_types_2015_2018_TDmcwha <- rfMultiLocale(sub_2015_2018_data, 
                                            vrbls=c("TempF", "DewF", 
                                                    "month", "hr",
                                                    "minHeight", "ceilingHeight", 
                                                    "WindSpeed", "predomDir", 
                                                    "modSLP"
                                                    ),
                                            locs=NULL, 
                                            locVar="city",
                                            pred="city",
                                            ntree=50, 
                                            seed=2006301420, 
                                            mtry=4
                                            )
## 
## Running for locations:
## [1] "Chicago, IL"     "Las Vegas, NV"   "New Orleans, LA" "San Diego, CA"
evalPredictions(rf_types_2015_2018_TDmcwha, 
                plotCaption = "Temp, Dew Point, Month, Hour of Day, Cloud Height, Wind, SLP", 
                keyVar="city"
                )
## Warning: Removed 1 rows containing missing values (geom_text).

## # A tibble: 16 x 5
##    locale          predicted       correct     n     pct
##    <fct>           <fct>           <lgl>   <int>   <dbl>
##  1 Chicago, IL     Chicago, IL     TRUE     9847 0.941  
##  2 Chicago, IL     Las Vegas, NV   FALSE     121 0.0116 
##  3 Chicago, IL     New Orleans, LA FALSE     236 0.0226 
##  4 Chicago, IL     San Diego, CA   FALSE     258 0.0247 
##  5 Las Vegas, NV   Chicago, IL     FALSE     153 0.0146 
##  6 Las Vegas, NV   Las Vegas, NV   TRUE    10048 0.961  
##  7 Las Vegas, NV   New Orleans, LA FALSE      60 0.00574
##  8 Las Vegas, NV   San Diego, CA   FALSE     192 0.0184 
##  9 New Orleans, LA Chicago, IL     FALSE     224 0.0213 
## 10 New Orleans, LA Las Vegas, NV   FALSE      98 0.00932
## 11 New Orleans, LA New Orleans, LA TRUE     9859 0.938  
## 12 New Orleans, LA San Diego, CA   FALSE     329 0.0313 
## 13 San Diego, CA   Chicago, IL     FALSE     173 0.0166 
## 14 San Diego, CA   Las Vegas, NV   FALSE     281 0.0269 
## 15 San Diego, CA   New Orleans, LA FALSE     197 0.0189 
## 16 San Diego, CA   San Diego, CA   TRUE     9789 0.938

Even with a small forest (50 trees), the model is almost always separating Las Vegas, Chicago, San Diego, and New Orleans. While the climates are very different in these cities, it is striking that the model has so few misclassifications.

How do other cities map against these classifications?

# Predictions on 2015/2018 data
helperPredictPlot(rf_types_2015_2018_TDmcwha$rfModel, 
                  df=filter(mutate(metarData, hr=lubridate::hour(dtime)), 
                            !(str_sub(source, 1, 4) %in% c("kord", "klas", "kmsy", "ksan"))
                            ), 
                  predOrder=c("Chicago, IL", "San Diego, CA", "New Orleans, LA", "Las Vegas, NV")
                  )

## # A tibble: 104 x 4
##    locale             predicted           n    pct
##    <chr>              <fct>           <int>  <dbl>
##  1 Atlanta, GA (2016) Chicago, IL      3363 0.384 
##  2 Atlanta, GA (2016) Las Vegas, NV     809 0.0924
##  3 Atlanta, GA (2016) New Orleans, LA  3709 0.424 
##  4 Atlanta, GA (2016) San Diego, CA     870 0.0994
##  5 Boston, MA (2016)  Chicago, IL      7710 0.891 
##  6 Boston, MA (2016)  Las Vegas, NV     430 0.0497
##  7 Boston, MA (2016)  New Orleans, LA   291 0.0336
##  8 Boston, MA (2016)  San Diego, CA     223 0.0258
##  9 Dallas, TX (2016)  Chicago, IL      2999 0.343 
## 10 Dallas, TX (2016)  Las Vegas, NV    1147 0.131 
## # ... with 94 more rows

Classifications are broadly as expected based on the previous archetype analysis. Variable importances are plotted:

helperPlotVarImp(rf_types_2015_2018_TDmcwha$rfModel)

Dew point and temperature by month continue to be strong factors for separating the four cities in this analysis. SLP, minimum cloud height, and prevailing wind direction are also meaningful.

An assessment can be run for the 2015-2018 model:

# Run for the full model including SLP
probs_2015_2018_TDmcwha <- 
    assessPredictionCertainty(rf_types_2015_2018_TDmcwha, 
                              keyVar="city", 
                              plotCaption="Temp, Dew Point, Month/Hour, Clouds, Wind, SLP", 
                              showAcc=TRUE
                              )
## Note: Using an external vector in selections is ambiguous.
## i Use `all_of(keyVar)` instead of `keyVar` to silence this message.
## i See <https://tidyselect.r-lib.org/reference/faq-external-vector.html>.
## This message is displayed once per session.
## Note: Using an external vector in selections is ambiguous.
## i Use `all_of(pkVars)` instead of `pkVars` to silence this message.
## i See <https://tidyselect.r-lib.org/reference/faq-external-vector.html>.
## This message is displayed once per session.

  • Predictions with 80%+ of the votes are made ~75% of the time, and these predictions are ~99% accurate
  • Predictions with <80% of the votes are made ~25% of the times, and these predictions are ~80% accurate
  • The percentage of votes received appears to be a reasonable proxy for the confidence of the prediction

A similar process can be run for assessing the classification of the other cities against the 2015-2017 data for Chicago, Las Vegas, New Orleans, and San Diego:

useData <- metarData %>%
    filter(!(str_sub(source, 1, 4) %in% c("kord", "klas", "kmsy", "ksan"))) %>%
    mutate(hr=lubridate::hour(dtime))
    
# Run for the model excluding SLP
probs_allcities_2015_2018_TDmcwh <- 
    assessPredictionCertainty(rf_types_2015_2018_TDmcwha, 
                              testData=useData,
                              keyVar="locale", 
                              plotCaption="Temp, Dew Point, Month/Hour, Clouds, Wind, modSLP", 
                              showHists=TRUE
                              )
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

The model is frequently not so confident in assigning an archetype to related cities, though it frequently gets the most sensible assignment.

An assessment is run to look at the evolution of the model as it take in the ‘next best’ variable for building out the random forest:

# Define the variables to be considered
possVars <- c("TempF", "DewF", "month", "hr", "minHeight", 
              "ceilingHeight", "WindSpeed", "predomDir", "modSLP"
              )

# Create a container for storing the best random forest and variable added from each run
rfContainer <- vector("list", length=length(possVars))
rfVarAdds <- vector("character", length=length(possVars))

# Run the functions using a for loop over the length of possVars
for (ctr in 1:length(possVars)) {
    
    # Pull in the results of the previous run
    if (ctr==1) {
        prevVars <- c()
    } else {
        prevVars <- rfVarAdds[1:(ctr-1)]
    }
    
    # Run each of them through the combinations
    tmpList <- helperRFCombinations(possVars, df=sub_2015_2018_data, prevVars=prevVars)

    # Assess the performance
    tmpAccuracy <- helperVariableAccuracy(tmpList, possVars=possVars, prevVars=prevVars)
    
    # Prepare to repeat the process
    bestRow <- tmpAccuracy %>%
        filter(locale=="OOB") %>%
        filter(accuracy==max(accuracy))
    bestRow
    
    # Update the rfContainer and rfVarAdds elements
    rfContainer[[ctr]] <- tmpList[[as.integer(pull(bestRow, vrblNum))]]
    rfVarAdds[ctr] <- pull(bestRow, vrblName)
    cat("\nVariable Added:", rfVarAdds[ctr], "\n")

}
## 
## Will run for: TempF DewF month hr minHeight ceilingHeight WindSpeed predomDir modSLP
## Fixed variable(s) will be:  
## 
## Running for locations:
## [1] "Chicago, IL"     "Las Vegas, NV"   "New Orleans, LA" "San Diego, CA"  
## 
## Running for locations:
## [1] "Chicago, IL"     "Las Vegas, NV"   "New Orleans, LA" "San Diego, CA"  
## 
## Running for locations:
## [1] "Chicago, IL"     "Las Vegas, NV"   "New Orleans, LA" "San Diego, CA"  
## 
## Running for locations:
## [1] "Chicago, IL"     "Las Vegas, NV"   "New Orleans, LA" "San Diego, CA"  
## 
## Running for locations:
## [1] "Chicago, IL"     "Las Vegas, NV"   "New Orleans, LA" "San Diego, CA"  
## 
## Running for locations:
## [1] "Chicago, IL"     "Las Vegas, NV"   "New Orleans, LA" "San Diego, CA"  
## 
## Running for locations:
## [1] "Chicago, IL"     "Las Vegas, NV"   "New Orleans, LA" "San Diego, CA"  
## 
## Running for locations:
## [1] "Chicago, IL"     "Las Vegas, NV"   "New Orleans, LA" "San Diego, CA"  
## 
## Running for locations:
## [1] "Chicago, IL"     "Las Vegas, NV"   "New Orleans, LA" "San Diego, CA"

## 
## Variable Added: DewF 
## 
## Will run for: TempF month hr minHeight ceilingHeight WindSpeed predomDir modSLP
## Fixed variable(s) will be: DewF 
## 
## Running for locations:
## [1] "Chicago, IL"     "Las Vegas, NV"   "New Orleans, LA" "San Diego, CA"  
## 
## Running for locations:
## [1] "Chicago, IL"     "Las Vegas, NV"   "New Orleans, LA" "San Diego, CA"  
## 
## Running for locations:
## [1] "Chicago, IL"     "Las Vegas, NV"   "New Orleans, LA" "San Diego, CA"  
## 
## Running for locations:
## [1] "Chicago, IL"     "Las Vegas, NV"   "New Orleans, LA" "San Diego, CA"  
## 
## Running for locations:
## [1] "Chicago, IL"     "Las Vegas, NV"   "New Orleans, LA" "San Diego, CA"  
## 
## Running for locations:
## [1] "Chicago, IL"     "Las Vegas, NV"   "New Orleans, LA" "San Diego, CA"  
## 
## Running for locations:
## [1] "Chicago, IL"     "Las Vegas, NV"   "New Orleans, LA" "San Diego, CA"  
## 
## Running for locations:
## [1] "Chicago, IL"     "Las Vegas, NV"   "New Orleans, LA" "San Diego, CA"

## 
## Variable Added: TempF 
## 
## Will run for: month hr minHeight ceilingHeight WindSpeed predomDir modSLP
## Fixed variable(s) will be: DewF TempF 
## 
## Running for locations:
## [1] "Chicago, IL"     "Las Vegas, NV"   "New Orleans, LA" "San Diego, CA"  
## 
## Running for locations:
## [1] "Chicago, IL"     "Las Vegas, NV"   "New Orleans, LA" "San Diego, CA"  
## 
## Running for locations:
## [1] "Chicago, IL"     "Las Vegas, NV"   "New Orleans, LA" "San Diego, CA"  
## 
## Running for locations:
## [1] "Chicago, IL"     "Las Vegas, NV"   "New Orleans, LA" "San Diego, CA"  
## 
## Running for locations:
## [1] "Chicago, IL"     "Las Vegas, NV"   "New Orleans, LA" "San Diego, CA"  
## 
## Running for locations:
## [1] "Chicago, IL"     "Las Vegas, NV"   "New Orleans, LA" "San Diego, CA"  
## 
## Running for locations:
## [1] "Chicago, IL"     "Las Vegas, NV"   "New Orleans, LA" "San Diego, CA"

## 
## Variable Added: month 
## 
## Will run for: hr minHeight ceilingHeight WindSpeed predomDir modSLP
## Fixed variable(s) will be: DewF TempF month 
## 
## Running for locations:
## [1] "Chicago, IL"     "Las Vegas, NV"   "New Orleans, LA" "San Diego, CA"  
## 
## Running for locations:
## [1] "Chicago, IL"     "Las Vegas, NV"   "New Orleans, LA" "San Diego, CA"  
## 
## Running for locations:
## [1] "Chicago, IL"     "Las Vegas, NV"   "New Orleans, LA" "San Diego, CA"  
## 
## Running for locations:
## [1] "Chicago, IL"     "Las Vegas, NV"   "New Orleans, LA" "San Diego, CA"  
## 
## Running for locations:
## [1] "Chicago, IL"     "Las Vegas, NV"   "New Orleans, LA" "San Diego, CA"  
## 
## Running for locations:
## [1] "Chicago, IL"     "Las Vegas, NV"   "New Orleans, LA" "San Diego, CA"

## 
## Variable Added: modSLP 
## 
## Will run for: hr minHeight ceilingHeight WindSpeed predomDir
## Fixed variable(s) will be: DewF TempF month modSLP 
## 
## Running for locations:
## [1] "Chicago, IL"     "Las Vegas, NV"   "New Orleans, LA" "San Diego, CA"  
## 
## Running for locations:
## [1] "Chicago, IL"     "Las Vegas, NV"   "New Orleans, LA" "San Diego, CA"  
## 
## Running for locations:
## [1] "Chicago, IL"     "Las Vegas, NV"   "New Orleans, LA" "San Diego, CA"  
## 
## Running for locations:
## [1] "Chicago, IL"     "Las Vegas, NV"   "New Orleans, LA" "San Diego, CA"  
## 
## Running for locations:
## [1] "Chicago, IL"     "Las Vegas, NV"   "New Orleans, LA" "San Diego, CA"

## 
## Variable Added: predomDir 
## 
## Will run for: hr minHeight ceilingHeight WindSpeed
## Fixed variable(s) will be: DewF TempF month modSLP predomDir 
## 
## Running for locations:
## [1] "Chicago, IL"     "Las Vegas, NV"   "New Orleans, LA" "San Diego, CA"  
## 
## Running for locations:
## [1] "Chicago, IL"     "Las Vegas, NV"   "New Orleans, LA" "San Diego, CA"  
## 
## Running for locations:
## [1] "Chicago, IL"     "Las Vegas, NV"   "New Orleans, LA" "San Diego, CA"  
## 
## Running for locations:
## [1] "Chicago, IL"     "Las Vegas, NV"   "New Orleans, LA" "San Diego, CA"

## 
## Variable Added: minHeight 
## 
## Will run for: hr ceilingHeight WindSpeed
## Fixed variable(s) will be: DewF TempF month modSLP predomDir minHeight 
## 
## Running for locations:
## [1] "Chicago, IL"     "Las Vegas, NV"   "New Orleans, LA" "San Diego, CA"  
## 
## Running for locations:
## [1] "Chicago, IL"     "Las Vegas, NV"   "New Orleans, LA" "San Diego, CA"  
## 
## Running for locations:
## [1] "Chicago, IL"     "Las Vegas, NV"   "New Orleans, LA" "San Diego, CA"

## 
## Variable Added: hr 
## 
## Will run for: ceilingHeight WindSpeed
## Fixed variable(s) will be: DewF TempF month modSLP predomDir minHeight hr 
## 
## Running for locations:
## [1] "Chicago, IL"     "Las Vegas, NV"   "New Orleans, LA" "San Diego, CA"  
## 
## Running for locations:
## [1] "Chicago, IL"     "Las Vegas, NV"   "New Orleans, LA" "San Diego, CA"

## 
## Variable Added: WindSpeed 
## 
## Will run for: ceilingHeight
## Fixed variable(s) will be: DewF TempF month modSLP predomDir minHeight hr WindSpeed 
## 
## Running for locations:
## [1] "Chicago, IL"     "Las Vegas, NV"   "New Orleans, LA" "San Diego, CA"

## 
## Variable Added: ceilingHeight

The evolution of accuracy can then be plotted:

# Pull the accuracy data from the variables selected
tblAccuracy <- map_dfr(rfContainer, .f=helperExtractAccuracy, .id="vrblNum") %>%
    mutate(vrblName=rfVarAdds[as.integer(vrblNum)], 
           locale=factor(locale, levels=c("OOB", unique(locale)[unique(locale) != "OOB"])), 
           plotLabel=ifelse(as.integer(vrblNum)==1, vrblName, paste0("add ", vrblName))
           )

# Plot the gains in accuracy, facetted by 'locale'
ggplot(tblAccuracy, aes(x=fct_reorder(plotLabel, -as.integer(vrblNum)))) + 
    geom_text(aes(y=accuracy, label=paste0(round(100*accuracy), "%"))) + 
    facet_wrap(~locale, nrow=1) + 
    ylim(c(0, 1)) +
    geom_hline(yintercept=0.25, lty=2) +
    labs(x="", 
         y="", 
         title="Evolution of accuracy as next-best variables are added",
         caption="Null accuracy is 25%"
         ) +
    coord_flip()

As seen previously, dew point, temperature, and month are significant differentiating variables, driving roughly 75% accuracy in classifying the archetypes.

Adding altimeter (SLP) bumps the accuracy by another ~10%, while adding minimum cloud height and prevailing wind direction bump the accuracy by another ~5%.

This shows some of the power of the random forest algorithm as it is given additional variables to explore. Evolution of error can be plotted to see the impact:

oobError <- c()
for (ctr in 1:9) {
    oobError <- c(oobError, rfContainer[[ctr]]$rfModel$err.rate[, "OOB"])
}

tibble::tibble(nVars=rep(1:9, each=25), 
               ntrees=rep(1:25, times=9), 
               accuracy=1-oobError
               ) %>%
    ggplot(aes(x=ntrees, y=accuracy)) + 
    geom_line(aes(group=nVars, color=factor(nVars))) +
    labs(x="Number of Trees", 
         y="OOB Accuracy", 
         title="Accuracy improves with more trees and more variables"
         ) + 
    scale_color_discrete("# Variables")

With a greater number of variables, there is a greater lift in accuracy as the forest grows larger.

Suppose that an attempt is made to classify the year for a single city. Example code includes:

# Create a subset of the data for only Chicago (2014-2019)
sub_kord_data <- metarData %>%
    mutate(hr=lubridate::hour(dtime)) %>%
    filter(str_sub(source, 1 ,4) %in% "kord")

# Run random forest for Chicago 2014-2019 data
rf_kord_TDmcwha <- rfMultiLocale(sub_kord_data, 
                                 vrbls=c("TempF", "DewF", 
                                         "month", "hr",
                                         "minHeight", "ceilingHeight", 
                                         "WindSpeed", "predomDir", 
                                         "modSLP"
                                         ),
                                 locs=NULL, 
                                 locVar="year",
                                 pred="year",
                                 ntree=200, 
                                 seed=2007011309, 
                                 mtry=4
                                 )
## 
## Running for locations:
## [1] 2014 2015 2016 2017 2018 2019

Prediction accuracy and variable importance can then be assessed:

# Evaluate prediction accuracy
evalPredictions(rf_kord_TDmcwha, 
                plotCaption = "Temp, Dew Point, Month, Hour of Day, Cloud Height, Wind, SLP", 
                keyVar="year", 
                locOrder=TRUE
                )

## # A tibble: 36 x 5
##    locale predicted correct     n    pct
##    <fct>  <fct>     <lgl>   <int>  <dbl>
##  1 2014   2014      TRUE     2016 0.798 
##  2 2014   2015      FALSE     136 0.0539
##  3 2014   2016      FALSE      85 0.0337
##  4 2014   2017      FALSE     101 0.04  
##  5 2014   2018      FALSE     102 0.0404
##  6 2014   2019      FALSE      85 0.0337
##  7 2015   2014      FALSE     126 0.048 
##  8 2015   2015      TRUE     2033 0.774 
##  9 2015   2016      FALSE     145 0.0552
## 10 2015   2017      FALSE     103 0.0392
## # ... with 26 more rows
# Evaluate variable importance
helperPlotVarImp(rf_kord_TDmcwha$rfModel)

# Evaluate error evolution
errorEvolution(rf_kord_TDmcwha, useCategory="OOB", subT="Chicago (2014-2019)")

## # A tibble: 200 x 3
##    ntree Category Error
##    <int> <chr>    <dbl>
##  1     1 OOB      0.491
##  2     2 OOB      0.488
##  3     3 OOB      0.489
##  4     4 OOB      0.476
##  5     5 OOB      0.469
##  6     6 OOB      0.454
##  7     7 OOB      0.442
##  8     8 OOB      0.428
##  9     9 OOB      0.418
## 10    10 OOB      0.408
## # ... with 190 more rows

Interestingly, the model predicts the year with 75%-80% accuracy (null accuracy 17%), suggesting there is significant annual variation in Chicago climate. Sea-level-pressure has the largest variable importance (SLP is a mix of altitude, temperature, dew point, and high/low pressure systems). The first 50 trees significantly reduce the OOB error, with more modest, but still continuing, error reduction afterwards.

The process is run for Las Vegas:

# Create a subset of the data for only Las Vegas (2014-2019)
sub_klas_data <- metarData %>%
    mutate(hr=lubridate::hour(dtime)) %>%
    filter(str_sub(source, 1 ,4) %in% "klas")

# Run random forest for Las Vegas 2014-2019 data
rf_klas_TDmcwha <- rfMultiLocale(sub_klas_data, 
                                 vrbls=c("TempF", "DewF", 
                                         "month", "hr",
                                         "minHeight", "ceilingHeight", 
                                         "WindSpeed", "predomDir", 
                                         "modSLP"
                                         ),
                                 locs=NULL, 
                                 locVar="year",
                                 pred="year",
                                 ntree=200, 
                                 seed=2007011405, 
                                 mtry=4
                                 )
## 
## Running for locations:
## [1] 2014 2015 2016 2017 2018 2019

Prediction accuracy and variable importance can then be assessed:

# Evaluate prediction accuracy
evalPredictions(rf_klas_TDmcwha, 
                plotCaption = "Temp, Dew Point, Month, Hour of Day, Cloud Height, Wind, SLP", 
                keyVar="year", 
                locOrder=TRUE
                )

## # A tibble: 36 x 5
##    locale predicted correct     n    pct
##    <fct>  <fct>     <lgl>   <int>  <dbl>
##  1 2014   2014      TRUE     1885 0.730 
##  2 2014   2015      FALSE     175 0.0678
##  3 2014   2016      FALSE     142 0.0550
##  4 2014   2017      FALSE     149 0.0577
##  5 2014   2018      FALSE     109 0.0422
##  6 2014   2019      FALSE     121 0.0469
##  7 2015   2014      FALSE     197 0.0757
##  8 2015   2015      TRUE     1711 0.658 
##  9 2015   2016      FALSE     162 0.0623
## 10 2015   2017      FALSE     169 0.0650
## # ... with 26 more rows
# Evaluate variable importance
helperPlotVarImp(rf_klas_TDmcwha$rfModel)

# Evaluate error evolution
errorEvolution(rf_klas_TDmcwha, useCategory="OOB", subT="Las Vegas (2014-2019)")

## # A tibble: 200 x 3
##    ntree Category Error
##    <int> <chr>    <dbl>
##  1     1 OOB      0.576
##  2     2 OOB      0.569
##  3     3 OOB      0.565
##  4     4 OOB      0.560
##  5     5 OOB      0.554
##  6     6 OOB      0.548
##  7     7 OOB      0.536
##  8     8 OOB      0.527
##  9     9 OOB      0.518
## 10    10 OOB      0.511
## # ... with 190 more rows

Prediction accuracies are lower for Las Vegas, averaging 60%-75% (2014 appears to be much more differentiated than the other years). This is suggestive that there is less annual variation in the Las Vegas climate than there is in the Chicago climate, though still enough to meaningfully pull apart the years.

Modified seal-level pressure is the top predictor, and the majority of the OOB error reduction occurs in the first 50 trees.

Suppose that modSLP is eliminated as a predictor variable, and the process is scaled down to 50 trees and repeated for Chicago:

# Run random forest for Chicago 2014-2019 data
rf_kord_TDmcwh_50 <- rfMultiLocale(sub_kord_data, 
                                   vrbls=c("TempF", "DewF", 
                                           "month", "hr",
                                           "minHeight", "ceilingHeight", 
                                           "WindSpeed", "predomDir"
                                           ),
                                   locs=NULL, 
                                   locVar="year",
                                   pred="year",
                                   ntree=50, 
                                   seed=2007011410, 
                                   mtry=4
                                   )
## 
## Running for locations:
## [1] 2014 2015 2016 2017 2018 2019

Prediction accuracy and variable importance can then be assessed:

# Evaluate prediction accuracy
evalPredictions(rf_kord_TDmcwh_50, 
                plotCaption = "Temp, Dew Point, Month, Hour of Day, Cloud Height, Wind", 
                keyVar="year", 
                locOrder=TRUE
                )

## # A tibble: 36 x 5
##    locale predicted correct     n    pct
##    <fct>  <fct>     <lgl>   <int>  <dbl>
##  1 2014   2014      TRUE     1621 0.629 
##  2 2014   2015      FALSE     194 0.0753
##  3 2014   2016      FALSE     175 0.0679
##  4 2014   2017      FALSE     217 0.0842
##  5 2014   2018      FALSE     197 0.0764
##  6 2014   2019      FALSE     174 0.0675
##  7 2015   2014      FALSE     240 0.0901
##  8 2015   2015      TRUE     1554 0.583 
##  9 2015   2016      FALSE     238 0.0893
## 10 2015   2017      FALSE     196 0.0736
## # ... with 26 more rows
# Evaluate variable importance
helperPlotVarImp(rf_kord_TDmcwh_50$rfModel)

# Compare error evolution through the first 50 trees
tibble::tibble(ntree=1:50, 
               withSLP=rf_kord_TDmcwha$rfModel$err.rate[1:50, "OOB"], 
               noSLP=rf_kord_TDmcwh_50$rfModel$err.rate[1:50, "OOB"]
               ) %>%
    pivot_longer(-ntree, names_to="Model", values_to="Error") %>%
    ggplot(aes(x=ntree, y=Error, color=Model)) + 
    geom_line() + 
    geom_text(data=~filter(., ntree %in% c(1, 50)), 
              aes(y=Error+0.02, label=paste0(round(100*Error, 1), "%"))
              ) +
    labs(x="# Trees", 
         y="OOB Error Rate", 
         title="OOB Error Evolution for SLP Included/Excluded"
         ) + 
    ylim(c(0, NA))

Modified sea-level pressure is revealed to be a significant driver of prediction accuracy for Chicago, confirming findings of the previous variable importance analysis. This suggests different patterns of high and low pressure being in control by year may meaningfully improve the ability to predict Chicago by year.

The process can also be run for Las Vegas:

# Run random forest for Las Vegas 2014-2019 data
rf_klas_TDmcwh_50 <- rfMultiLocale(sub_klas_data, 
                                   vrbls=c("TempF", "DewF", 
                                           "month", "hr",
                                           "minHeight", "ceilingHeight", 
                                           "WindSpeed", "predomDir"
                                           ),
                                   locs=NULL, 
                                   locVar="year",
                                   pred="year",
                                   ntree=50, 
                                   seed=2007011420, 
                                   mtry=4
                                   )
## 
## Running for locations:
## [1] 2014 2015 2016 2017 2018 2019

Prediction accuracy and variable importance can then be assessed:

# Evaluate prediction accuracy
evalPredictions(rf_klas_TDmcwh_50, 
                plotCaption = "Temp, Dew Point, Month, Hour of Day, Cloud Height, Wind", 
                keyVar="year", 
                locOrder=TRUE
                )

## # A tibble: 36 x 5
##    locale predicted correct     n    pct
##    <fct>  <fct>     <lgl>   <int>  <dbl>
##  1 2014   2014      TRUE     1610 0.630 
##  2 2014   2015      FALSE     258 0.101 
##  3 2014   2016      FALSE     177 0.0693
##  4 2014   2017      FALSE     194 0.0759
##  5 2014   2018      FALSE     148 0.0579
##  6 2014   2019      FALSE     168 0.0658
##  7 2015   2014      FALSE     298 0.115 
##  8 2015   2015      TRUE     1338 0.516 
##  9 2015   2016      FALSE     228 0.0880
## 10 2015   2017      FALSE     227 0.0876
## # ... with 26 more rows
# Evaluate variable importance
helperPlotVarImp(rf_klas_TDmcwh_50$rfModel)

# Compare error evolution through the first 50 trees
tibble::tibble(ntree=1:50, 
               withSLP=rf_klas_TDmcwha$rfModel$err.rate[1:50, "OOB"], 
               noSLP=rf_klas_TDmcwh_50$rfModel$err.rate[1:50, "OOB"]
               ) %>%
    pivot_longer(-ntree, names_to="Model", values_to="Error") %>%
    ggplot(aes(x=ntree, y=Error, color=Model)) + 
    geom_line() + 
    geom_text(data=~filter(., ntree %in% c(1, 50)), 
              aes(y=Error+0.02, label=paste0(round(100*Error, 1), "%"))
              ) +
    labs(x="# Trees", 
         y="OOB Error Rate", 
         title="OOB Error Evolution for SLP Included/Excluded"
         ) + 
    ylim(c(0, NA))

Modified sea-level pressure is revealed as a meaningful driver of prediction accuracy for Las Vegas also, though with a lesser effect than seen in Chicago. This potentially suggests less variability by year in SLP for Las Vegas.

Since SLP is so meaningful, are there any patterns by year?

sub_kord_data %>%
    bind_rows(sub_klas_data, .id="cityFile") %>%
    select(cityFile, year, month, modSLP) %>%
    mutate(city=case_when(cityFile==1 ~ "Chicago", cityFile==2 ~ "Las Vegas", TRUE ~ "Error")) %>%
    filter(!is.na(modSLP)) %>%
    ggplot(aes(x=factor(year), y=modSLP)) + 
    geom_boxplot(fill="lightblue") + 
    facet_wrap(~city) +
    labs(x="", y="modSLP", title="Modified Sea-Level Pressure by Year (Chicago and Las Vegas)")

Chicago has a higher sea-level pressure than Las Vegas (as expected), though neither city shows much variation on average from year to year. The Chicago data are explored further by month:

sub_kord_data %>%
    select(year, month, modSLP) %>%
    filter(!is.na(modSLP)) %>%
    ggplot(aes(x=month, y=modSLP)) + 
    geom_boxplot(fill="lightblue") + 
    facet_wrap(~year) +
    labs(x="", y="modSLP", title="Modified Sea-Level Pressure by Month and Year (Chicago)")

sub_kord_data %>%
    select(year, month, modSLP) %>%
    filter(!is.na(modSLP)) %>%
    group_by(year, month) %>%
    summarize(medSLP=median(modSLP)) %>%
    ggplot(aes(x=month, y=medSLP, color=factor(year), group=year)) + 
    geom_line(lwd=1) + 
    labs(x="", y="Median modSLP", title="Median modified Sea-Level Pressure by Month and Year (Chicago)")

sub_kord_data %>%
    select(year, month, modSLP) %>%
    filter(!is.na(modSLP)) %>%
    group_by(year, month) %>%
    summarize(medSLP=median(modSLP)) %>%
    group_by(month) %>%
    summarize(sdSLP=sd(medSLP)) %>%
    ggplot(aes(x=month, y=sdSLP)) + 
    geom_point() + 
    labs(x="", 
         y="Standard Deviation of Median modSLP by Year", 
         title="Variability by year of modified Sea-Level Pressure by Month (Chicago)"
         )

There are meaningful differences in median modified sea-level pressure by month by year. This is suggestive that several year of data are needed to define an archetype, as there is risk that otherwise the archetype will be an anomalous “high pressure was in contol in June” when that is not generally applicable. And, since high vs. low pressure are often related to temperature, dew point, coludiness, wind speed, and the like, the need for multiple years of data to define an archetype likely extends to all the other variables as well.

With greater variability in SLP by year in the cold-weather months, predictive ability in the cold-weather months is expected to be higher:

rf_kord_TDmcwha$testData %>% 
    group_by(month) %>% 
    summarize(pctError=1-mean(correct)) %>%
    ggplot(aes(x=month, y=pctError)) + 
    geom_col(fill="lightblue") + 
    geom_text(aes(y=pctError/2, label=paste0(round(100*pctError, 1), "%"))) +
    labs(x="", y="Error Rate", title="Prediction Error Rate by Month (Chicago)")

As expected, the model takes advantage of greater annual variation in modSLP during the cold season to make better prediction of which year it is during the cold season.